TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers by YWHyuk · Pull Request #267 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-06-19T04:22:35Z

What

Replaces the timing-path TOG producer (MLIR -> Python dict -> ONNX -> C++ TileGraphParser) with a compiled, shape-parametric trace producer: post-vcix MLIR -> skeleton -> EmitC -> C++ -> .so. TOGSim dlopens the .so, runs it to record an instruction trace, and feeds it into the existing Simulator/Core (timing core unchanged). Driven by a new --trace_so mode; the legacy ONNX-TOG path is kept and marked DEPRECATED, so nothing existing breaks.

Pipeline

post-vcix .mlir
  | build_skeleton.py        loops + memref.dma_start/wait -> togsim.* ; DCE the rest
  | dep_analysis.py          per-op read/write SRAM buffers (SSA) + vcix preload/matmul pairing
  | lower_to_emitc.py        togsim.* -> emitc.call_opaque ; drive upstream convert-*-to-emitc
  v
EmitC --mlir-translate--> C++ --g++ -shared--> trace.so
  | run_producer (dlopen)    EmitCtx callbacks record a TraceRec stream
  | togsim_trace_bridge.cc   TraceRec -> TileGraph (explicit dependency DAG)
  v
existing Simulator / Core    cycles, DRAM traffic

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Dependencies are derived from two sources available pre-collapse:

SRAM last-writer per buffer (load->compute, the Y_spad accumulator chain), recovered via SSA + a virtual SA_WEIGHTS buffer that folds preload->matmul.
The systolic array modeled as a pipeline (occupancy/latency split) with two explicit, distinctly-named barriers:
- MEMORY_BAR (renamed from BAR): the DMA/tag memory fence; an async load -> compute waits the data's resp-complete.
- COMPUTE_BAR (new): the compute fence; a store waits all systolic-array pipelines to drain.

Both barriers are first-class trace ops (togsim.compute_barrier -> ABI togsim_compute_barrier) visible in the trace dump and the instruction stream.

Status

256^3 GEMM runs end-to-end through the real Simulator via --trace_so.
Cycle comparison vs the legacy build_tog path on the same kernel + gem5 cycle_list: compute work and DRAM traffic match; matmuls pipeline on 2 SAs; the memory fence correctly delays compute until the weight load arrives.
Known open items (documented in docs/design/togsim_cpp_trace.md sec 10): preload-concurrency cap (needs non-zero preload occupancy), parallel output tiles (dispatch granularity), broader op coverage (conv/SDPA/vector).

Testing

tests/test_togsim_skeleton.py, test_togsim_emitc.py, test_togsim_runtime.py (7 tests).
Manual --trace_so GEMM through TOGSim.
Legacy path untouched (comment-only DEPRECATED markers).

Design of record: docs/design/togsim_cpp_trace.md (sec 9-10).

🤖 Generated with Claude Code

Design-of-record + status + handoff for the C++ trace producer: post-vcix MLIR -> skeleton+API -> EmitC -> compiled .so that TOGSim dlopens and feeds to the existing timing Core. Async DMAs pair with explicit memory barriers by the runtime tag slot (tag_id, tag_slot) via the Core tag table; the SRAM-buffer last-writer DAG carries compute dependencies. Validated on the 256^3 GEMM: trace 2518 vs legacy 2698 on the real gem5 cycle table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

One op-walk generator and the one-line attribute builders/readers were copied across the passes. Consolidate into passes/_mlir_util.py (walk_ops; i32/i64/i64_array/str_attr; attr_int/attr_bool/attr_i64_array) and adopt it in lower_to_vcix, decompose_transfer, dma_fine_grained, lower_dma_to_gemmini, lower_vlane_idx. walk_ops needs no MLIR bindings so the module imports mlir.ir lazily; pure functions, no module-global state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The compiler half of the trace pipeline. build_skeleton (C2) reduces a post-vcix kernel to a loop skeleton + togsim.* API ops: dma_start -> togsim.dma (tag_id + runtime tag index), dma_wait -> explicit togsim.memory_barrier, compute node -> togsim.compute, then a use-based DCE strips the data math. dep_analysis derives per-op SRAM read/write buffers (the last-writer dependency DAG); cycle_table builds the tile_id->cycle sidecar; lower_to_emitc (C4) rewrites togsim.* to emitc.call_opaque and drives the upstream EmitC pipeline to C++. extension_codecache emits the .so + cycle sidecar opt-in (TORCHSIM_DUMP_TRACE_SO=1), snapshotting the gem5 cycle_list before the legacy TOG consumes it. tog_generator marked DEPRECATED. No static event_id: an async dma pairs with its barrier by the runtime tag slot, since one static op runs once per loop iteration. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

TOGSim side of the trace pipeline. togsim_runtime.{h,cc} is the producer ABI (v11): togsim_dma (void, carries tag_id + tag_slot), togsim_compute, togsim_memory_barrier (the explicit async-DMA sync), togsim_compute_barrier, togsim_core_alloc. togsim_loader records a TraceRec stream; the bridge (togsim_trace_bridge) turns it into a TileGraph: an async dma and its memory_barrier pair by (tag_id, tag_slot) through the Core tag table (set_tag_finish / register_tag_waiter), the barrier becomes the last-writer of the loaded buffer, and the SRAM read/write-buffer DAG drives compute deps with the occupancy/latency systolic-array pipeline + an explicit compute fence before a store. main.cc gains --trace_so/--cycle_table; Instruction/Core gain MEMORY_BAR + COMPUTE_BAR and the pipeline-child model. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

test_togsim_skeleton pins the togsim_ops vocabulary against the ABI header and exercises build_skeleton on a post-vcix fixture (event-id-free output, explicit memory_barrier). test_togsim_emitc builds the .so and checks the EmitC/symbol-table shape + that it runs against a stub runtime. The togsim_runtime test links the real runtime, runs the loader, and checks the recorded trace (resolved addresses, tag-paired barriers, looked-up cycles). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The .so's exported entry function (the renamed kernel skeleton the loader dlopens and runs) is renamed togsim_emit -> togsim_kernel. Pure rename of the single ENTRY_SYMBOL contract (producer export == loader dlsym); no signature or behavior change. Updated togsim_ops.ENTRY_SYMBOL, the runtime header/loader, lower_to_emitc, the tests' dlsym/nm checks, and the design docs. Left togsim_emitc (the C4 lowering / its test) untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…ag alloc) The trace bridge's dma tag key has an empty accum component, so it pairs correctly only for a single-tile reduction (the current GEMM). Document the agreed fix for multi-tile-K and conv: hoist the tag memref alloc into the reduction-loop body (coarse, pre-fine-grained DMA) so each reduction iteration gets a fresh tag whose runtime identity is the per-iteration tag_id -- no accum-axis enumeration, works for any reduction depth. Because that alloc dominates both the load and wait nests, dma and memory_barrier pair by the SSA tag handle, with tag_idx kept as the subtile slot. Comment only; no behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…d tag key) The bridge keyed the Core tag table on the static (tag_id, tag_slot), so the DMAs of successive reduction iterations of one static op shared a key and would collide for multi-tile-K (and conv, reduction = kh*kw*C). Mint a fresh per-DMA- record tag key (uniq) instead, and pair each memory_barrier with the CURRENT load for its (tag_id, tag_slot) -- it is 1 load : N barriers (the load runs once per reduction iteration; each consumer waits the same tag), and the load/consumer nests run in order within the reduction body, so "current load" is correct (not a FIFO). Distinct uniq per load => successive iterations never collide; axis- agnostic, no coordinate enumeration. Single-tile GEMM is unchanged (2518 cycles). FIXME kept: the per-iteration tag is reconstructed here from record order, while the producer IR still carries one static func-entry tag alloc -- the faithful fix is to hoist that memref.alloc into the reduction-loop body and emit a matching per-iteration togsim.tag_alloc threaded by SSA (then uniq is unnecessary). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…conv) A tag memref was allocated once at the func entry and reused by every reduction iteration of a static DMA, so the per-iteration tag identity was only an artifact of the timing path's bridge. Make it real in the IR: when fine-grained splits a matmul load, allocate a fresh tag memref.alloc just before the coarse dma_start and replace_all_uses_with the old tag -- this rewires both the re-emitted dma_start AND its dma_wait, and the coarse dma sits at the reduction- loop body level so the alloc dominates the load and wait nests. Each reduction iteration thus allocates its own tag (distinct for multi-tile-K / conv, no coordinate enumeration); the now-dead func-entry alloc is erased. Sync stores keep their tag. Legacy materializes to a distinct alloc per iteration (its calc_tag accum component becomes redundant); verified the 256^3 GEMM still passes and the trace path is unchanged at 2518 cycles. The bridge FIXME is updated: build_skeleton still collapses the in-loop alloc to one static tag_id, so the bridge's per-record uniq is still what distinguishes iterations until that identity is threaded as an SSA tag handle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

… slot build_skeleton carried the dma_wait tag index verbatim onto togsim.memory_- barrier. lower_to_vcix builds that index with a -acc_iv term for each accumulation (reduction) loop var -- a sentinel marking the reduction axis, not an arithmetic offset (legacy TileGraphParser skips stride -1 for the same reason). The matching async load index (dma_fine_grained) is subtile-only, so at reduction iteration > 0 the producer evaluated -acc_iv to a negative slot, the recorded barrier tag_slot diverged from the load slot, and TOGSim aborted with "Key does not exist in subgraph's tag table" on subtile + multi-tile-K. _strip_accum_terms now drops the negative-coefficient dim terms from the wait's affine.apply (composing with a selector that zeros those dims), so the barrier slot is subtile-only and pairs with its load. Reduction iterations are still told apart by the per-iteration tag alloc and the fresh per-record Core key in the bridge, not by the slot. Single-tile kernels (no reduction term) fall through unchanged. Verified: 256x512x256 forced to 128x128 subtiles (2 K-tiles) now runs to 5774 cycles instead of crashing; single-tile 256^3 unchanged. Adds a self-contained regression for _strip_accum_terms. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

Document that the trace tag_slot is subtile-only: build_skeleton strips the lower_to_vcix -acc_iv accumulation marker from the dma_wait index so a memory_barrier pairs with the slot its load wrote, mirroring legacy TileGraphParser's skip of stride -1. Record that subtile + multi-tile-K (256x512x256, 128x128 subtiles, 2 K-tiles) now runs at 5774 cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The -1 coefficients lower_to_vcix puts on accumulation loop vars in the A/B dma_wait tag indices are a reduction-axis sentinel honored only by the legacy TOG path (TileGraphParser); the trace path strips them in build_skeleton._strip_accum_terms. Document this at both emission sites and note they are kept for byte-identity with the C++ -test-pytorchsim-to-vcix pass and should be removed (not flagged) once legacy retires. Comments only; output is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…_END Replace the bare togsim_core_alloc marker with a higher-order togsim_dispatch(ctx, tile_fn, iv, n_iv) wrapper. The runtime round-robins a core from the pool, brackets the work-item with TILE_BEGIN/TILE_END trace records, and invokes the producer's outlined tile function. The work-item scope is now exactly the function call, not an implicit "ops until the next core_alloc" range, and one general (kernel-independent) dispatcher serves every kernel via a uniform iv-array tile signature (togsim_tile_fn). Core alloc and the begin/end boundary are runtime-owned; the producer stays core-count transparent. TraceRec gains TILE_BEGIN/TILE_END (drops DISPATCH); the bridge opens a subgraph on TILE_BEGIN (bound to the record's core) and flushes it on TILE_END, and the reference timer treats both as zero-cost boundaries. Verified on the subtile 256x512x256 case: 5774 cycles, identical to the pre-outline core_alloc form. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…spatch lower_to_emitc now outlines the innermost parallel-loop body into a uniform togsim_kernel_tile(ctx, iv, n) func and replaces it with a togsim_dispatch(ctx, togsim_kernel_tile, iv, n) call, instead of inserting a bare togsim_core_alloc marker inline. The dispatcher loop marshals the parallel induction vars (m, n) into an int64 array and passes the tile fn as a verbatim function pointer (#emitc.opaque), so the work-item scope is the tile function body and the runtime wrapper owns the core-alloc + TILE_BEGIN/TILE_END boundary. The outline runs after the togsim.* ops become emitc.call_opaque: it moves the body ops into the tile fn, recovers each parallel index as index_cast(iv[k]) inside it, and remaps the captured ctx / induction vars / constants (Value == is identity; external constants are cloned). Only ctx, the parallel IVs, and constants may be captured (dynamic-shape captures raise -> P4). mlir-to-cpp renders a static togsim_kernel_tile defined before the extern "C" togsim_kernel dispatcher. togsim_ops gains DISPATCH_CALLEE / TILE_SYMBOL (drops CORE_ALLOC_CALLEE). Tests: the emitc/runtime harnesses define togsim_dispatch (calling the tile fn) and the skeleton/emitc contract checks use DISPATCH_CALLEE; the outlined .so builds, dlopens, and runs. Docs updated (outline DONE, ABI v12). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

…e path Model the per-core VMEM/spad as a finite resource so the trace path does not prefetch unboundedly. A load occupies its tile's footprint when it issues and the buffer-version it fills is freed when its last consumer issues (tag-last: the bridge tags only each version's last reader). A load that would overflow the spad does not issue that cycle -- it retries until a consumer frees a tile. - Instruction: per-load buffer-version id + footprint (from tile_numel*elem_bits); per-consumer list of versions it frees on issue. - togsim_trace_bridge: group the fine DMAs that fill a coarse tile into one buffer-version (a read closes it -> the next write is a new version), tag the last reader to free it. Tracked buffers are the DMA-loaded ones; the accumulator / virtual SA-weights are never load-written, so they are not charged. The pool persists across work-items (one physical per-core spad). - Core: per-core sram_used / sram_capacity (= core_spad_size_kb) + a version->bytes map; gate MOVIN issue on free space; release on COMP/MOVOUT issue. - Simulator::check_frozen: if work remains (running()) but nothing is in flight, the spad is too small to hold a kernel's working set -- error out after a margin (kWedgeThreshold) instead of looping forever. - core_spad_size_kb config key (default 0 = unset/unlimited). Only trace-path instructions are gated; legacy TileGraphParser insts keep alloc id -1. Verified: 1024^3 GEMM unchanged at 16 MB (compute-bound); shrinking the spad throttles loads and below one tile-pair the run reports "spad too small" rather than deadlocking. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

core_spad_size_kb drives the trace-path SRAM throttle; provide it from the config rather than a hardcoded default. TPUv2/v3/v4 VMEM = 16 MB (16384); the 8x8 toy arrays = 128 KB x 8 = 1 MB (1024). stonne/heterogeneous configs are left unset (different accelerator path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

Convert a `--log_level trace` log into a Chrome Trace Event JSON (open in https://ui.perfetto.dev). Per-core lanes dma / sa / vector; each hardware unit is replayed as a server so real idle gaps show and slices do not nest. - sa/vector: slice width = compute_cycle - overlapping_cycle (occupancy, tail excluded); --num-sa N splits the SA into sa0..saN-1. - dma: slice = the request-injection window [INST_ISSUED, ASYNC_DMA_ISSUE]. When a load is blocked from issuing (spad full under the SRAM throttle) its INST_ISSUED is delayed past the engine-free time, so the stall shows as a real idle gap on the dma lane (vs. continuous injection when the spad is large). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

Model a systolic array's weight registers as a finite resource so a preload cannot run unboundedly ahead of the matmuls that consume its weight, and pin each matmul to the SA its weight was preloaded into (weight-stationary locality). Previously every COMP was round-robined independently, so preloads batched far ahead (16-deep) and freed the B-spad they read too early. A preload acquires a weight slot on a systolic array that has a free one (round-robin among free); if all are full it does not issue that cycle and retries. It pins its matmul consumers (its pipeline children) to that SA and gives them a shared token. A matmul frees its slot when it is done READING the weight -- the streaming phase, finish_cycle - overlapping_cycle -- not at full finish: the drain tail flushes results without touching the weight, so releasing at finish would hold the slot through the tail and stall the next double-buffered preload (a visible SA bubble, ~2% inflated cycles). The release is scheduled at issue into a per-core cycle-keyed queue drained before dispatch; the last consumer frees the slot. A preload with no matmul consumers is left alone, so paths without preload->matmul occupancy edges (legacy TileGraphParser uses only add_child) keep unbounded round-robin / infinite weights. - Instruction: WeightToken {sa, refcount} + per-op _assigned_sa. - Core: per-SA weight-slot counts (cap = sa_weight_buffer_depth); pick_free_weight_sa for the preload gate + SA choice; matmul runs on its weight's SA (rr fallback); _weight_release_q + process_weight_releases() for the streaming-end release. - SimulationConfig/Common: sa_weight_buffer_depth (default 2 = weight double-buffer, a convention/tunable, not a verified per-gen constant). 1024^3 GEMM is compute-bound and unchanged (48184 == unbounded baseline), with preloads now paced 1:8 with matmuls and SA lanes balanced 288/288; spad/weight- bound cases tighten. Legacy path verified unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

… uses it The trace timeline drew matmul/preload onto sa0/sa1 by round-robin of issue order, which no longer matches reality now that the weight-buffer throttle pins each matmul to the SA its weight was preloaded into. That made a store look like it issued before a (mis-assigned) SA lane finished, when the model is in fact correct (the compute barrier drains all SAs before the store issues). Expose the SA the Core actually used (it is already recorded on the instruction by the throttle): - CoreTraceLog: add sa=<idx> to the COMP issue/finish detail line (-1 for vector). - trace_timeline.py: place each SA op on the lane it reports (sa= field) and auto-split sa0..saN from it; round-robin stays as the fallback for older logs. Lanes now reflect the real per-SA schedule and the store cleanly follows both SAs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

A core running several work-items (dispatches) over-serialized: the COMPUTE_BAR before a store finished only once EVERY systolic array + the VPU had drained, regardless of which dispatch's matmuls occupied them. So one tile's store waited for an unrelated later tile's compute too -- on a 2-core 2048^3 GEMM, tile 0's store issued at 71619 (after tile 2's compute) when tile 0's own compute finished at 37536. Make the fence drain only the computes it gates: each async compute, when it issues, feeds its finish_cycle to its COMPUTE_BAR pipeline-child (update_fence_finish, folded into the existing release_pipeline_children loop so no extra pass), and the bar finishes once core_cycle reaches that max -- independent of other dispatches sharing the SA pipelines. - Instruction: _fence_finish + update_fence_finish/get_fence_finish. Also carries a per-op work-item id (_tile_group) used by the trace/timeline in the next commit. - Core: COMPUTE_BAR waits core_cycle >= fence_finish instead of all-SA-empty; finish_cycle is computed before release_pipeline_children so the fence is fed. 2048^3 (2 tiles/core): tile 0's store now issues at 38826 (right after its own compute) and stores overlap the next tile -> 91648 -> 81883 (~10.7%). Compute-bound 4096^3 unchanged (442023). Legacy unaffected (builds no COMPUTE_BAR). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

Emit the dispatch work-item id on every trace instruction and use it in the timeline viewer, and rework the DMA lanes so individual loads are legible. - togsim_trace_bridge: stamp each instruction with its work-item index (cur_tile_group, bumped per TILE_BEGIN). - CoreTraceLog: add tile= to the COMP / DMA / MEMORY_BAR detail lines (-1 for legacy, which has no work-item). - trace_timeline.py: - color each slice by its tile (work-item) so one output tile's load / preload / matmul / store share a color across lanes and cores; - split the single dma lane into mvin / mvin-r / mvout: injection [issued, async] vs response [async, data-ready], so DRAM-response timing (shared-bandwidth contention) is visible separately from per-core injection; - serialize the injection on one DMA engine (server replay) so a load's bar is the engine time it actually uses, not iss->async inflated by queue wait; - label each DMA slice by its own addr_name so input / weight / K-panel loads stay distinct (tile is conveyed by color). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The timeline needs the moment data starts arriving, not just when the last byte lands (DRAM_RESP_DONE). The trace only had ASYNC_DMA_ISSUE (all requests injected) and the final response, so a load's data window had to start at injection-done, missing data that returns while the load is still injecting. Emit DRAM_RESP_FIRST the first time an op's DRAM response arrives (a one-shot flag on the Instruction, set in push_memory_response). The viewer then draws a load's read-bandwidth bar from its first response to data-ready -- the real data-arrival window, including bytes that came back during injection. - TraceLogTags: kFirstDramResponse = "DRAM_RESP_FIRST". - Instruction: _got_first_response one-shot flag + got/mark accessors. - Core: log it on the first push_memory_response of an op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

Rework the DMA view along GPU-profiler lines, replacing the load-lifetime bars (which overlapped and folded in queue wait) with bandwidth-resource lanes: - dram-rd: a load's read bar [first DRAM response, data-ready] = the real data-arrival window (uses DRAM_RESP_FIRST), serialized on the aggregate bandwidth so each load is one visible bar (packed row = saturated bus). - dram-wr: a store's write bar [issued, finished] -- writes go out with the request (fire-and-forget; acks land after the store has finished), so this, not the ack window, is the transfer. - drop the dma-eng injection lane (the engine queue is not the bottleneck) and the in-flight counter. Each DMA slice keeps its own addr_name label and tile color, so input / weight / K-panel loads stay distinct and one output tile's ops share a color. A saturated dram-rd against a half-idle SA reads as memory-bound at a glance: the 2-core 4096^3 GEMM shows dram-rd ~100% / SA ~59%, and doubling DRAM channels flips it (SA ~100%, 442023 -> 277603 cycles). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0183wNdfEEdoNSjYKitFB9Me

The minimal Instruction(Opcode) constructor used for barriers (MEMORY_BAR, COMPUTE_BAR) left ready_counter uninitialized, while the full constructor sets it from num_parents. A barrier accumulates its count via inc_ready_counter from each parent starting from that garbage value, so dec_ready_counter never returns to 0 unless the garbage was already 0. The barrier then never becomes ready and the kernel never completes -- the frozen-state guard fires with a misleading "spad too small" message. Whether the garbage was 0 depended on process memory layout (env size such as the presence of TORCHSIM_DIR shifts it), making the wedge a non-deterministic heisenbug. Give ready_counter a default member initializer of 0 so the barrier ctor starts from a correct base. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f

Emit the trace producer .so on every compile (best-effort) and drive the standalone TOGSim run from it by default; the legacy ONNX TOG is now the opt-in fallback via TORCHSIM_LEGACY_TOG=1. Previously the .so was emitted only under TORCHSIM_DUMP_TRACE_SO=1 and run only under TORCHSIM_RUN_TRACE=1, so the existing test suite never exercised the C++ TOG. Autotune candidates still run legacy (the .so is a single tiling); the trace path drives the final chosen-tiling run, and a missing .so falls back to legacy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f

codegen_nodes unpacked self.autotune()[:2] into (optimal_src_code, meta_code) when the strategy is autotune and timing mode is on. autotune returns [None, None, None] when it cannot autotune -- e.g. a size-1 pointwise kernel whose ranges == [1], so make_choices yields no candidates -- which clobbered the valid meta_code (the kernel's arg_attributes) with None. The fall-through then returned that None, so the generated wrapper passed arg_attributes=None to the cycle-sim caller and MLIRKernelCallerCodeGen crashed on len(None) (e.g. test_add with a functional-off timing config). Unpack into a temporary so the original meta_code survives the no-autotune case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f

The trace path returned without printing the "Simulation finished" marker that the Python result parser (TOGSimulator.get_result_from_file) searches for, so it warned "Unable to parse the output file" and returned inf metrics. Print the marker before the core stats, matching the legacy path's order. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f

Several Instruction members had no default initializer and were left as garbage by the minimal Instruction(Opcode) barrier constructor (and overlapping_cycle even by the full constructor): compute_cycle, overlapping_cycle, start_cycle, finish_cycle, subgraph_id, dram_addr, _tile_numel. Reading garbage from these in occupancy / weight-release timing produces memory-layout-dependent wedges that only surface under some process layouts -- the same heisenbug class as the ready_counter fix. Give them all a default initializer of 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01QwQDvWo2McMjTEuNWQSA4f

YWHyuk force-pushed the feature/togsim-cpp-trace branch 2 times, most recently from cc507fd to f5e8e55 Compare June 19, 2026 08:12

YWHyuk and others added 23 commits June 22, 2026 21:13

YWHyuk force-pushed the feature/togsim-cpp-trace branch from 1151f6a to 7f70bbb Compare June 22, 2026 12:13

YWHyuk and others added 4 commits June 22, 2026 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267

TOGSim C++ trace-generation pipeline (P0-P3): explicit dataflow producer + barriers#267
YWHyuk wants to merge 28 commits into
developfrom
feature/togsim-cpp-trace

YWHyuk commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YWHyuk commented Jun 19, 2026

What

Pipeline

Dependency model (no in-order, no runtime tag-hash, no op heuristics)

Status

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant